The objective of this project is to examine what factors of the CalEnviroScreen may be associated with Cardiovascular Disease prevalence (Age‐adjusted rate of emergency department visits for heart attacks per 10,000).  CES is a composite score of two constructs, “Pollution Burden” and “Population Characteristics”. Pollution Burden contains two sub-constructs, namely “Exposures” and “Environmental Effects”. Population Characteristics also contains two sub-constructs, “Sensitive Populations” and “Socioeconomic Factor”. Each sub-construct includes several indicators for measurement illustrated below.
  1. Exposures has eight indicators as follows: Ozone (concentration), PM2.5 (concentration), Diesel PM (emissions), Drinking Water (index), Children’s Lead Risk from Housing (index), Pesticide Use (Ibs/sq. mi.), Toxic Releases (RSEI toxicity weighted releases), Traffic (impacts)

  2. Environmental Effects has five indicators as follows: Cleanup Sites (weighted sites), Groundwater Threats (weighted sites), Hazardous Waste Facilities/ Generators (weighted sites), Impaired Water Bodies (number of pollutants), Solid Waste Sites/Facilities (weighted sites and facilities)

  3. Sensitive Population has three indicators as follows: Asthma (rate per 10,000), Cardiovascular Disease (heart attacks per 10,000), Low Birth Weight (percent).

  4. Socioeconomic Factor has five indicators as follows: Educational Attainment (percent), Housing Burden (percent), Linguistic Isolation (percent), Poverty (percent), Unemployment (percent)

    Cardiovascular Disease is an indicator representing the aspect of “Sensitive Population” in the CES score, which is the target variable considered in this project. The risk factors considered here will include all indicators in the other three sub-constructs: “Exposures”, “Environmental Effects”, and “Socioeconomic Factor”. Moreover, CalEnviroScreen also provides a demographic profile including the householder percentage of three age category variables and six racial category variables. These demographic variables will be also considered for the analysis since they are in the same macro-level as in CES data. The key questions in the project are as follow.

Question 1: What is the spatial distribution of Cardiovascular Disease prevalence in the Bay Area? Question 2: What risk factors are statistically associated with Cardiovascular Disease prevalence? Question 3: Are there any confounding effects among key risk factors? What do these confounding effect mean for the Cardiovascular Disease prevalence? Question 4: Could the key risk factors predict the the Cardiovascular Disease prevalence?

Question 1: What is the spatial distribution of Cardiovascular Disease Prevalence in the Bay Area

–Spatial Distribution map of Cardiovascular Disease in Bay Area

Results: The east and north-east sides of Bay areas particularly have high proportions of Cardiovascular Disease prevalence.

– Spatial Distribution map of Ozone in Bay Area

Results: The Ozone distribution is much more widely spread. Some regions in east and west parts of Bay area with high Ozone concentration are also of high Cardiovascular Disease Prevalence. There is some degree of similarity in the distributions between Ozone and Cardiovascular Disease prevalence.

Question 2: What risk factors are statistically associated with Cardiovascular Disease Prevalence?

Use of correlation coefficients to inspect association

Correlation of Exposures (predictors) with Cardiovascular Disease

Exposures (predictors): – Ozone (concentration), PM2.5 (concentration), Diesel PM (emissions), Drinking Water (index), Children’s Lead Risk from Housing (index), Pesticide Use (Ibs/sq. mi.), Toxic Releases (RSEI toxicity weighted releases, Traffic (impacts)

Result: Only Ozone (in the “Exposure” sub-construct) has correlation coefficient 0.28 with Cardiovascular Disease Prevalence , the other indicators all have correlation coefficient less than 0.2, indicating almost no or low correlation with Cardiovascular Disease..

Correlation of Environmental Effects (predictors) with Cardiovascular Disease Environmental Effects (predictors): – Cleanup Sites (weighted sites), Groundwater Threats (weighted sites), Hazardous Waste Facilities/ Generators (weighted sites), Impaired Water Bodies (number of pollutants), Solid Waste Sites/Facilities (weighted sites and facilities)
Result: All indicators in the sub-construct “Environmental Effects” have correlation coefficient less than 0.1, indicating no correlation with Cardiovascular Disease.

Correlation of Socioeconomic Factors (predictors) with Cardiovascular Disease Socioeconomic Factors (predictors): – Educational Attainment (percent, Housing Burden (percent), Linguistic Isolation (percent), Poverty (percent), Unemployment (percent)

Result: Generally, as the correlation coefficient is less than 0.3, it is considered as low correlation. Hence, except for Linguistic Isolation and Housing Burden with correlation coefficient less than 0.3, the other three indicators have correlation coefficients ranging between 0.33 ~ 0.45, indicating the low to medium degree of correlation with Cardiovascular Disease. Educational Attainment has the highest correlation coefficient, followed by Poverty,and finally the Unemployment rate.

Correlation of Demographic Factos (predictors) with Cardiovascular Disease Demographic Factors (predictors): – under_10 (householder percentage in the census tract), Year_11_to_64, Elderly, Hispanic, White, African_American, Native_American, Asian_American, Other
Result: The correlation coefficients of three age groups are all less than 0.3. Hispanic has the highest correlation coefficient 0.56, followed by White (-0.45),and finally the African American (0.39).

Use of scatter plot to inspect association – Key factors include Ozon, Edcation attainmen, Povert, Unemployment rate, Hispanic, White, African American

Plot of Ozone vs. Cardiovascular Disease
Plot of Education attainment vs. Cardiovascular Disease
Plot of Poverty vs. Cardiovascular Disease
Plot of Unemployment rate vs. Cardiovascular Disease
Plot of Hispanic vs. Cardiovascular Disease
Plot of White vs. Cardiovascular Disease
Plot of African American vs. Cardiovascular Disease
Use of simple regression to examine the association between each key factor with Cardiovascular Disease Prevalence – Fit simple regression model on Ozone Education attainment, Poverty, Unemployment rate, Hispanic, White, African American, individually

Simple regression 1 - Regress Cardiovascular Disease Prevalence on Ozone

Confounding effect for Education attainment and White Plot of Cardiovascular Disease Prevalence vs. Education attainment, controlling White
Result: Four trend lines for the four quartile categories of White all have positive slopes which decrease as White percentage in the census tract increases. For the group with the highest White percentage, the Education attainment is strongly correlated with Cardiovascular Disease Prevalence. With the White percentage decreases, the correlation between Education attainment and Cardiovascular Disease Prevalence become lesser.

Confounding effect for Poverty and Hispanic Plot of Cardiovascular Disease Prevalence vs. Poverty, controlling Hispanic

Result: Two trend lines for the groups of the highest and second lowest Hispanic percentages have positive and roughly equivalent slopes. For the group of the lowest Hispanic percentage, the Poverty is less correlated with Cardiovascular Disease Prevalence. For the group of the second highest Hispanic percentage, the Poverty is the least correlated with Cardiovascular Disease Prevalence.

Confounding effect for Poverty and Unemployment rate Plot of Cardiovascular Disease Prevalence vs. Poverty, controlling Unemployment rate
Result: The slopes of four trend lines are roughly equivalent to that of the black best-fit line. For the group with highest and the lowest percentages of Unemployment rate, the relation between Poverty and Cardiovascular Disease Prevalence is slightly lesser.

Confounding effect for Poverty and White Plot of Cardiovascular Disease Prevalence vs. Poverty, controling White

Result: For the group with the highest percentage of White, the correlation between Poverty and Cardiovascular Disease Prevalence is greater. For the other three groups, the correlation between Poverty and Cardiovascular Disease Prevalence is slightly lesser as the percentage of White in the census tract decreases.

Confounding effect for Poverty and African American Plot of Cardiovascular Disease Prevalence vs. Poverty, controlling African American
Result: For the group with the lowest percentages of African American, the relation between Poverty and Cardiovascular Disease Prevalence is slightly greater. For the other three groups, the correlation between Poverty and Cardiovascular Disease Prevalence is lesser as the percentages of African American increases,

Confounding effect for Unemployment rate and African American Plot of Cardiovascular Disease Prevalence vs. Unemployment rate, controlling African American

Result: For the group with the second highest percentage of African American in the range of (2.59,6.96], the Unemployment rate is less correlated with Cardiovascular Disease Prevalence. However, for the other three percentage groups of African American, the Unemployment rate is more positively correlated with Cardiovascular Disease Prevalence.

The Stacked Bar Charts are illustrated to further examine confounding effect among multiple factors
– (1) Check confounding – Mean Cardiovascular Disease prevalence: Education attainment by Hispanic (CES Demographic data)
Result: As the Hispanic percentage increases, the Mean Cardiovascular Disease Prevalence gets larger. The education attainment in the groups 4-8 & 8-15 tend to have greater Mean Cardiovascular Disease Prevalence. In the census tracts with Hispanic percentage between 9-18 & 31-86, the Mean Cardiovascular Disease Prevalence increases as Education attainment increases. On the contrary, for the groups with Hispanic percentage between 0-9 & 18-31, the Mean Cardiovascular Disease Prevalence decreases as Education attainment increases.
– (2) Check confounding – Mean Cardiovascular Disease Prevalence: Education attainment by Poverty
Result: As the Poverty percentage increases, the Mean Cardiovascular Disease Prevalence gets larger. As the education attainment increases, the Mean Cardiovascular Disease Prevalence gets larger, too. For the groups with higher Poverty percentage, the Mean Cardiovascular Disease Prevalence increases as Education attainment increases.
– (3) Check confounding – Mean Cardiovascular Disease Prevalence: Education attainment by White
Result: As the White percentage increases, the Mean Cardiovascular Disease Prevalence gets lesser. When the education attainment becomes more, the Mean Cardiovascular Disease Prevalence gets greater. For three groups with the higher White percentage,9-18, 18-31,31-94, the Mean Cardiovascular Disease Prevalence increases with increase of the Education attainment. However, the Mean Cardiovascular Disease Prevalence for White percentage group of 0-9 decreases with the increase of the Education attainment.
– (4) Check confounding – Mean Cardiovascular Disease Prevalence: Poverty by Hispanic
Result: As the Poverty percentage increases, the Mean Cardiovascular Disease Prevalence gets larger. For the groups with the lower Hispanic percentage,0-9, 9-18, 18-31, the Mean Cardiovascular Disease Prevalence roughly increases as the increase of Poverty. However,for the group with the highest Hispanic percentage, the Mean Cardiovascular Disease Prevalence is lower for the poverty percentage of 10-17 & 17-27.
– (5) Check confounding – Mean Cardiovascular Disease Prevalence: Poverty by Unemployment rate
Result: As the Unemployment rate increases, the Mean Cardiovascular Disease Prevalence gets larger. As the Poverty percentage increases, the Mean Cardiovascular Disease Prevalence gets larger, too. For the groups with higher Unemployment rate, the Mean Cardiovascular Disease Prevalence increases as Poverty percentage increases.
– (6) Check confounding – Mean Cardiovascular Disease Prevalence: Poverty by White
Result: As the White percentage increases, the Mean Cardiovascular Disease Prevalence gets lesser. For the groups with Poverty percentage of 10-17 & 17-27, the Mean Cardiovascular Disease Prevalence slightly decreases as the White percentage increases.
– (7) Check confounding – Mean Cardiovascular Disease Prevalence: Poverty by African_American
Result: As the African American percentage increases, the Mean Cardiovascular Disease Prevalence gets larger. For three groups with the African American percentage,0-1, 1-3, 3-37, the Mean Cardiovascular Disease Prevalence increases with the increase of Poverty. However, the Mean Cardiovascular Disease Prevalence for the group with the highest African American percentage decreases with the increase of Poverty.
– (8) Check confounding – Mean Cardiovascular Disease Prevalence: African American by Unemployment rate
Result: As the Unemployment rate increases, the Mean Cardiovascular Disease Prevalence gets larger. As the African American percentage increases, the Mean Cardiovascular Disease Prevalence gets larger, too. For the groups with higher Unemployment rate, the Mean Cardiovascular Disease Prevalence increases as African American percentage increases.

Question 4: Could the key risk factors predict the Cardiovascular Disease Prevalence?

Use of Logistic regression to predict the Cardiovascular Disease Prevalence

Estimate logit model using glm Y value is the log(Odds), predictors: Hispanic, African American, Ozone
## ## Call: ## glm(formula = cardio_flag ~ Hispanic + African_American + Ozone, ## family = quasibinomial(), data = ces4_bay_all_Demographic_logit) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.7231 -0.8113 0.1475 0.8667 2.1319 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -6.587790 0.582108 -11.317 <2e-16 *** ## Hispanic 0.063452 0.004922 12.891 <2e-16 *** ## African_American 0.099000 0.010745 9.214 <2e-16 *** ## Ozone 130.988416 15.471123 8.467 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for quasibinomial family taken to be 1.043026) ## ## Null deviance: 2093.2 on 1509 degrees of freedom ## Residual deviance: 1540.9 on 1506 degrees of freedom ## AIC: NA ## ## Number of Fisher Scoring iterations: 5
Result: Based on the summary of the estimated logit model, Hispanic, African American, Ozone are all significant with p value less than 0.001. The order for the significance of the factor effect is: Hispanic > African American > Ozone.

Odds of Cardiovascular Disease Prevalence for each factor – taking exp() of each GLM coefficient to calculate the odds

##      (Intercept)         Hispanic African_American            Ozone 
##     1.377079e-03     1.065508e+00     1.104066e+00     7.718734e+56
Probability of Cardiovascular Disease Prevalence for each factor
## (Intercept) Hispanic African_American Ozone ## 0.001375186 0.515857567 0.524729756 1.000000000

– Prediction of logistic regression—-

– 2*2 Table
##       
## .      FALSE TRUE
##   No     590  159
##   Yes    226  535

Result: True positive is 535/(159+535)=0.7709. True negative is 590/(590+226)=0.7230. Type I error is 159/(590+159)=0.2122. Type II error is 226/(226+535)=0.2970. Type I error and type II error of the estimated logit model are high. 77.09 % of The census tracts with the Cardiovascular Disease Prevalence greater than the medium can be correctly identified as “high” Cardiovascular Disease Prevalence area by the logistic model. 72.3 % of The census tracts with the Cardiovascular Disease Prevalence less than the medium can be correctly identified as “low” Cardiovascular Disease Prevalence area by the logistic model.

The key factors derived from the Correlation analysis, Simple regression and Multiple regression are used as the predictors to forecast the high/low Cardiovascular Disease Prevalence census tracts. Other important factors which are not considered here may be included in the model to further improve the prediction accuracy of the model.